Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gep: GEP-3440 - Gateway API Support for gRPC Retries #3441

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

shadialtarsha
Copy link

What type of PR is this?

/kind gep

What this PR does / why we need it:
Proposes configurations for gRPC retries within GRPCRotue.

Which issue(s) this PR fixes:

Fixes #3440

Does this PR introduce a user-facing change?:

Adds a new field `retry` to `GRPCRouteRule` to allow configuring retries for unsuccessful gRPC requests. 

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/gep PRs related to Gateway Enhancement Proposal(GEP) labels Nov 8, 2024
Copy link

linux-foundation-easycla bot commented Nov 8, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Nov 8, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @shadialtarsha!

It looks like this is your first PR to kubernetes-sigs/gateway-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/gateway-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 8, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @shadialtarsha. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@shadialtarsha
Copy link
Author

shadialtarsha commented Nov 8, 2024

@mikemorris already set me up for success here with his HTTP retries GEP. So this GEP is heavily inspired by his work.

I assume this GEP might depend on the output of #3219 or at least the conformance tests need the timeout to test the backoff not exceeding the timeout.

By the way this is my first PR so I hope I followed the process correctly 😅

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 8, 2024
geps/gep-3440/index.md Outdated Show resolved Hide resolved
geps/gep-3440/index.md Outdated Show resolved Hide resolved
geps/gep-3440/index.md Outdated Show resolved Hide resolved
geps/gep-3440/index.md Outdated Show resolved Hide resolved
geps/gep-3440/index.md Outdated Show resolved Hide resolved
shadialtarsha and others added 3 commits November 8, 2024 16:05
Co-authored-by: Sotiris Nanopoulos <[email protected]>
Co-authored-by: Sotiris Nanopoulos <[email protected]>
- No standard APIs for advanced retry logic, such as integrating with rate-limiting headers.
- No default retry policies for all routes within a namespace or for routes tied to a specific Gateway.
- No support for detailed backoff adjustments, like fine-tuning intervals, adding jitter, or setting max duration caps.
- No retry support for streaming or bidirectional APIs (maybe considered in future proposals).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this enforced in the API specification?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for calling that out. The API doesn't have a way to enforce this non-goal.

I am thinking of three ways to do that:

  1. Adding a restriction in the API documentation clarifying that retries apply only to unary calls, with a potential future option to expand to streaming. Something among the line as:
// Note: **Retries are supported only for unary gRPC calls.**
// Implementations MUST NOT apply retries to streaming or bidirectional
// gRPC calls, as these types of calls are stateful and retrying them
// could result in data loss or duplication.
  1. Explicit Field: Add a UnaryOnly field (e.g., UnaryOnly bool) that makes it clear retries are restricted to unary calls.
  2. Remove this restriction and let users choose whether to apply retries on any gRPC call type.

Would like to hear your thoughts on this.

@robscott
Copy link
Member

robscott commented Nov 8, 2024

@shadialtarsha Thanks for this PR! We're currently in our release scoping phase for v1.3. To have this considered for scope in v1.3, please propose it in #3403.

Copy link
Contributor

@kflynn kflynn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments but overall I think this is a good thing to do – thanks for starting in on this!! 🙂

Comment on lines +92 to +94
4. **Non-Idempotent Requests** (`non_idempotent`):
By default, Nginx does not retry non-idempotent requests (like POST or PUT) because they can cause side effects
if sent multiple times. However, you can enable retries for non-idempotent requests if needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that you MUST do something special to get NGINX to retry gRPC at all?

2. **Retry Limits**: Traefik provides configurable retry attempts and can set a maximum number of retries. However,
Traefik does not offer per-try timeout controls specific to each retry attempt. Instead, it typically relies on a
global request timeout, limiting the flexibility needed for more precise gRPC retry management (like Envoy’s `per_try_timeout`).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linkerd supports gRPC retry as well: you MUST configure a GRPCRoute for Linkerd to understand that gRPC semantics are desired, but after that you can configure retries either on Routes or Services. See https://linkerd.io/2.17/reference/retries/.

gRPC retries with specialized logic, while other proxies rely on HTTP error codes, lacking the precision needed
for gRPC.

### Go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really like to always see new API stuff described in mostly-English rather than in Go. I think you're saying this:

We're going to add a `retry` stanza to the GRPCRoute `rule`:

retry:
   reasons: an array of gRPC status code names
   attempts: an optional maximum number of retries, implementation-specific default
   backoff: minimum time between retries as a GEP-2257 Duration, implementation-specific default

All of these are Extended.

I feel like we should always be able to describe new additions like this -- if we really can't easily describe the API in English, we're probably not designing it well in the first place. 🙂

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shadialtarsha
Once this PR has been reviewed and has the lgtm label, please assign youngnick for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/gep PRs related to Gateway Enhancement Proposal(GEP) needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry Policy for GRPCRoute
7 participants